Round 1: Technical Screening (SQL + Pyspark)
↳ Given below 2 tables:
How many records will come up upon performing - inner join, outer join, left join, right join?
↳ Given an employees file having columns:
employee_id, employee_name, employee_salary, employee_dept, employee_manager
- Write a custom schema to read this file, given that separator is '|
- First 4 lines in the file are blank, 5th line is header. How would you read it?
- Suppose the values for column employee_name is of the format -> "Himani H Sharma". Derive three columns from it - first_name, middle_name, last_name
- Find all employees who have salaries greater than their manager's salary.
- Find the third highest salary in the whole table, and then per department.
↳ What's the difference between persist() & cache()?
↳ What's the difference between map() & flatMap()?
↳ What kind of AWS services have you used at your job?
Round 2: Client Interview Round (Pyspark + SQL)
↳ What kind of projects have you worked on so far? Describe the architecture along with full/incremental load strategies, and how data validation is performed.
↳ You are working on a data pipeline that processes user activity logs from a mobile app. The logs are stored in JSON format in a data lake (e.g., S3 or ADLS). Each record contains the following fields:
- user_id (string)
- event_type (string: e.g., "login", "purchase", "logout")
- event_timestamp (timestamp)
- device_type (string: e.g., "iOS", "Android")
1. Read the JSON file and filter out all the "purchase" event types.
2. For each user, find out the difference between maximum & minimum purchase timestamps.
↳ Below table has users & mobile numbers:
↳ Write Spark Program to generate count of distinct mobile numbers for each user.
Expected output:
↳ Given tables - Customers (id, name) & Orders (orderid, customer id, amount)
Return top 3 customers by total spending along with names and total spent.
Interviewer instruction: Use CTE, join & window function to solve the problem.
↳ Given a table having below cricket teams:
Find out all possible combinations of match fixtures without any repetitions.
For example - Ind vs NZ & NZ vs Ind are considered to be the same match fixture. Only one combination of this should be in the output.
↳ In windowing functions, which clause is mandatory to have? partition by or order by?
↳ How would you determine the number of jobs, tasks & stages given a sequence of transformations like - reading a file, applying union transformation, group by followed by writing the output to a table.
↳ How would 10 GB of data be processed in spark?
↳ Is read() an action or transformation?
↳ Are you aware of SCD? Describe SCD-1, SCD-2 & SCD-3, and whether you've implemented them at your job.
Round 3: HR round
↳ Salary negotiation, job location preference discussion.
⭐ Cracked all rounds, got the job offer!